WikiART General Exploratory Data Analysis¶
Dataset¶
We're using the WikiART General Dataset.
Processing the Metadata¶
The dataset is composed of .jpg files saved in the following format ...\style\artist\title.jpg. The details for each image are saved in ...\meta\artist.json\. We need to combine all the metadata into one .csv for our convenience.
import warnings
warnings.filterwarnings("ignore")
WikiARTCrawler API has the following fields: artist_url, year_start, year_end, media, genre, style, max_aspect_ratio, min_height, min_width when it calls for painting info from that wikiart api {"id": "5a3ed770edc2c978b03992c9", "title": "God's Trombone", "url": "gods-trombone-1927", "artistUrl": "aaron-douglas", "artistName": "Aaron Douglas", "artistId": "5a3e7ff4edc2c9cfcc4208a7", "completitionYear": 1927, "width": 321, "image": "https://uploads5.wikiart.org/00164/images/aaron-douglas/untitled-4.png", "height": 450, "detail": {"id": "5a3ed770edc2c978b03992c9", "title": "God's Trombone", "url": "gods-trombone-1927", "artistUrl": "aaron-douglas", "artistName": "Douglas Aaron", "artistId": "5a3e7ff4edc2c9cfcc4208a7", "completitionYear": 1927, "dictionaries": ["57726b4eedc2cb3880ad6e68", "57726b51edc2cb3880ad74a0", "57726b53edc2cb3880ad78c8"], "location": "", "period": null, "serie": null, "genres": ["history painting"], "styles": ["Art Deco", "Synthetic Cubism"], "media": [], "sizeX": null, "sizeY": null, "diameter": null, "galleries": [], "tags": ["Text"], "description": "", "width": 321, "image": "https://uploads5.wikiart.org/00164/images/aaron-douglas/untitled-4.png", "height": 450}}
""" #meta comes out as a list for each author's works. or rather dictionary?
import pandas as pd
file_path = 'data\\meta\\aaron-douglas.json'
df = pd.read_json(file_path)
df.iloc[-1] """
" #meta comes out as a list for each author's works. or rather dictionary?\nimport pandas as pd\nfile_path = 'data\\meta\\aaron-douglas.json'\ndf = pd.read_json(file_path)\ndf.iloc[-1] "
""" import os
import json """
' import os\nimport json '
""" file_path = 'data\\meta\\aaron-douglas.json'
with open(file_path, 'r') as file:
data = json.load(file)
# Extract the "detail" field into a list
details_list = [item.get('detail') for item in data]
# Print the resulting list
print(details_list) """
' file_path = \'data\\meta\\aaron-douglas.json\'\nwith open(file_path, \'r\') as file:\n data = json.load(file)\n\n# Extract the "detail" field into a list\ndetails_list = [item.get(\'detail\') for item in data]\n\n# Print the resulting list\nprint(details_list) '
""" df1 = pd.DataFrame(details_list)
df1 """
' df1 = pd.DataFrame(details_list)\ndf1 '
""" def extract_detail_from_json(json_file):
with open(json_file, 'r') as file:
data = json.load(file)
# Extract the "detail" field into a list
details = pd.DataFrame([item.get('detail') for item in data])
return details """
' def extract_detail_from_json(json_file):\n with open(json_file, \'r\') as file:\n data = json.load(file)\n # Extract the "detail" field into a list\n details = pd.DataFrame([item.get(\'detail\') for item in data])\n return details '
""" json_folder_path = 'data\\meta\\'
json_files = [f for f in os.listdir(json_folder_path) if f.endswith('.json')]
df_list = []
for json_file in json_files:
json_file_path = os.path.join(json_folder_path, json_file)
df_details = extract_detail_from_json(json_file_path)
df_list.append(df_details) """
" json_folder_path = 'data\\meta\\'\njson_files = [f for f in os.listdir(json_folder_path) if f.endswith('.json')]\n\ndf_list = []\nfor json_file in json_files:\n json_file_path = os.path.join(json_folder_path, json_file)\n df_details = extract_detail_from_json(json_file_path)\n df_list.append(df_details) "
#df_extract = pd.concat(df_list)
#df_extract.head(2)
#df_extract.to_csv('output.csv', index=False)
We have amalgamated the raw metadata into the output.csv file.
Loading Metadata¶
import pandas as pd
df_data = pd.read_csv('output.csv')
df_data.head()
| id | title | url | artistUrl | artistName | artistId | completitionYear | dictionaries | location | period | ... | media | sizeX | sizeY | diameter | galleries | tags | description | width | image | height | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5772882cedc2cb388009739f | In Baghdad | in-baghdad | 3d | 3D | 57726defedc2cb3880b54026 | NaN | ['57726b52edc2cb3880ad7898', '57726b4eedc2cb38... | NaN | NaN | ... | [] | NaN | NaN | NaN | [] | ['Fictional character'] | NaN | 300 | https://uploads1.wikiart.org/images/3d/in-bagh... | 432 |
| 1 | 5772882cedc2cb38800973ef | Untitled (Headz) | untitled-headz | 3d | 3D | 57726defedc2cb3880b54026 | NaN | ['57726b52edc2cb3880ad7898', '57726b4eedc2cb38... | NaN | NaN | ... | [] | NaN | NaN | NaN | [] | ['Orange'] | NaN | 350 | https://uploads5.wikiart.org/images/3d/untitle... | 511 |
| 2 | 5772882cedc2cb38800973bf | No Great Crime | no-great-crime-1983 | 3d | 3D | 57726defedc2cb3880b54026 | 1983.0 | ['57726b52edc2cb3880ad7898', '57726b4eedc2cb38... | NaN | NaN | ... | [] | NaN | NaN | NaN | [] | ['Font'] | NaN | 750 | https://uploads2.wikiart.org/images/3d/no-grea... | 494 |
| 3 | 5772882cedc2cb388009737d | 3D | 3d-1984 | 3d | 3D | 57726defedc2cb3880b54026 | 1984.0 | ['57726b52edc2cb3880ad7898', '57726b4eedc2cb38... | NaN | NaN | ... | [] | NaN | NaN | NaN | [] | [] | NaN | 564 | https://uploads7.wikiart.org/images/3d/3d-1984... | 400 |
| 4 | 5772882cedc2cb388009740f | Wild Bunch | wild-bunch-1984 | 3d | 3D | 57726defedc2cb3880b54026 | 1984.0 | ['57726b52edc2cb3880ad7898', '57726b4eedc2cb38... | NaN | NaN | ... | [] | NaN | NaN | NaN | [] | ['Text', 'Font'] | NaN | 647 | https://uploads6.wikiart.org/images/3d/wild-bu... | 400 |
5 rows × 23 columns
Cleaning the Metadata¶
df_data.shape
(152475, 23)
We have 152,475 items in the metadata, but less in actual images. We need to cross-reference and only keep ones that have actual image files.
import os
import zipfile
# Specify the directory where your zip archives are located
zip_directory = 'data\\'
# List to store extracted titles for all styles
all_titles = []
# Iterate through zip archives in the specified directory
for zip_filename in os.listdir(zip_directory):
# Join the directory and zip file name to get the full path
zip_full_path = os.path.join(zip_directory, zip_filename)
# Check if it is a file and ends with ".zip"
if os.path.isfile(zip_full_path) and zip_filename.lower().endswith('.zip'):
# Open the zip file
with zipfile.ZipFile(zip_full_path, 'r') as zip_ref:
# Extract each file in the zip archive
for file_info in zip_ref.infolist():
# Extract the file name
filename = file_info.filename
if filename.lower().endswith('.jpg'):
title = os.path.splitext(os.path.basename(filename))[0]
all_titles.append(title)
len(all_titles)
106636
print(all_titles[100000])
the-angels-appearing-to-the-shepherds-1809
So there's about a 50,000 image difference (empty folders). So we should cross-reference with the metadata and only keep entries that belong in both. The filename of the images are labelled by the url column of the metadata.
df_combine = df_data[df_data['url'].isin(all_titles)]
df_combine.shape
(93361, 23)
So there's a few thousand which are not included in the meta data. We'll drop those as well (for EDA).
Cleaning the Dataset¶
print(df_combine.columns.values.tolist())
['id', 'title', 'url', 'artistUrl', 'artistName', 'artistId', 'completitionYear', 'dictionaries', 'location', 'period', 'serie', 'genres', 'styles', 'media', 'sizeX', 'sizeY', 'diameter', 'galleries', 'tags', 'description', 'width', 'image', 'height']
df_combine['media'].unique()[0:5]
array(['[]', "['oil']", "['bronze', 'marble']", "['bronze']",
"['marble']"], dtype=object)
df_combine['location'].unique()[0:5]
array([nan, 'Italy', 'NaplesItaly', 'France', 'Albania'], dtype=object)
df_combine['period'].unique()[0:5]
array([nan,
"{'id': '5784c060edc2cb202cf4cd94', 'title': 'National Armenian Motive'}",
"{'id': '57726d8aedc2cb3880b499c1', 'title': 'Italian Period'}",
"{'id': '57726d8aedc2cb3880b499c3', 'title': 'Dresden Period'}",
"{'id': '57726d8aedc2cb3880b499c5', 'title': 'Vienna Period'}"],
dtype=object)
df_combine['galleries'].unique()
array(['[]', "['Private Collection']",
"['Museum of Modern Art (MoMA), New York City, NY, US']", ...,
"['Foundling Museum, London, UK']",
"['Keble College, Oxford, UK']",
"['Abbot Hall Art Gallery, Kendal, UK']"], dtype=object)
For the purposes of EDA we will drop most of the columns as many have missing values or they provide duplicate information. We should note that the description and serie would be useful for the image-to-text and language model portions of the project.
df_combine.drop(columns=['id','url','artistUrl','artistId','dictionaries','location','period','serie','sizeX','sizeY','diameter','galleries','image','description'],inplace=True)
df_combine.head(3)
| title | artistName | completitionYear | genres | styles | media | tags | width | height | |
|---|---|---|---|---|---|---|---|---|---|
| 58 | Gloucester 16A | Siskind Aaron | 1944.0 | ['photo'] | ['Abstract Expressionism'] | [] | ['monochrome'] | 446 | 600 |
| 59 | New York City W 1 | Siskind Aaron | 1947.0 | ['photo'] | ['Abstract Expressionism'] | [] | [] | 472 | 600 |
| 60 | New York 2 | Siskind Aaron | 1948.0 | ['photo'] | ['Abstract Expressionism'] | [] | ['Photograph', 'Monochrome photography', 'mono... | 672 | 600 |
Some data belongs to multiple styles. The WikiArtCrawler handles this by selecting the last element in the list of styles.
import plotly.io as pio
pio.renderers.default = 'notebook'
import plotly.express as px
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=False)
import ast
df_combine['styles'] = df_combine['styles'].apply(ast.literal_eval) #Convert from string to list
df_combine['styles_length'] = df_combine['styles'].apply(len)
fig1 = px.pie(df_combine, names='styles_length', title='Proportion of Data by Number of Styles',
labels={'styles_length': 'Number of Styles'},
hole=0.3)
fig1.update_layout(width=400, height=400)
fig1.show()
We should drop the rows without styles and take the last element of the list of styles for the multi-style entries. (At least that's what I think the author did)
df_combine.drop(df_combine[df_combine['styles_length'] == 0].index, inplace=True)
df_combine['styles'] = df_combine.apply(lambda row: [row['styles'][-1]] if row['styles_length'] > 1 else row['styles'], axis=1)
df_combine['styles_length'] = df_combine['styles'].apply(len)
df_combine['styles_length'].describe()
df_combine.drop(columns=['styles_length'],inplace=True)
df_combine.isnull().sum(axis = 0)
title 0 artistName 557 completitionYear 21838 genres 0 styles 0 media 0 tags 0 width 0 height 0 dtype: int64
We should also drop any rows in the artists field with missing values
df_combine.dropna(subset=['artistName'], inplace=True)
Distribution of Data¶
Artist Representation¶
artist_works_count = df_combine['artistName'].value_counts().reset_index()
print(artist_works_count)
artistName count 0 van Gogh Vincent 1918 1 Roerich Nicholas 1838 2 Renoir Pierre-Auguste 1408 3 Monet Claude 1358 4 Piranesi Giovanni Battista 1352 ... ... ... 1972 Froud Brian 1 1973 Rafael Soto Jesus 1 1974 Langlois Jérôme-Martin 1 1975 Hiratsuka Unichi 1 1976 Bury Pol 1 [1977 rows x 2 columns]
Fun fact: van Gogh produced a work every 36 hours of his art career (27 - 37)! That's why he has 1918 works.
# Get the artist works count
artist_works_count = df_combine['artistName'].value_counts().reset_index()
# Rename the columns for clarity
artist_works_count.columns = ['artistName', 'works_count']
# Define bins
bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 500, float('inf')]
# Bin the works count and create a new 'works_bin' column
artist_works_count['works_bin'] = pd.cut(artist_works_count['works_count'], bins=bins, labels=[f'{x}-{y}' for x, y in zip(bins[:-1], bins[1:])], include_lowest=True)
# Convert 'Interval' objects to strings
artist_works_count['works_bin'] = artist_works_count['works_bin'].astype(str)
# Sort the 'works_bin' column to ensure correct order
order = [f'{x}-{y}' for x, y in zip(bins[:-1], bins[1:])]
artist_works_count['works_bin'] = pd.Categorical(artist_works_count['works_bin'], categories=order, ordered=True)
artist_works_count = artist_works_count.sort_values('works_bin')
# Group by 'works_bin' and calculate proportions
total_artists = len(artist_works_count)
binned_counts = artist_works_count.groupby('works_bin').size().reset_index(name='artist_count')
binned_counts['proportion'] = binned_counts['artist_count'] / total_artists
# Plotting with Plotly
fig_artist = px.bar(binned_counts, x='works_bin', y='proportion',
labels={'works_bin': 'Number of Works', 'artist_count': 'Proportion'},
title='Works by Artist')
fig_artist.update_layout(width=800, height=500)
fig_artist.show()
So we don't have to worry about an artist being over-represented. Most artists in the dataset only contribute very few works.
Completion Year¶
fig_completion_year = px.histogram(df_combine, x="completitionYear", nbins=30, title="Distribution of Completion Year")
fig_completion_year.update_layout(width=800, height=500, xaxis_title="Completion Year",bargap=0.30)
fig_completion_year.show()
The majority of art comes from very recent art movements. There might be an issue with copyright? The 100 years between 1850-1950 is where most of the art is from. Maybe we should consider have a more balanced time representation?
Genres¶
# Distribution of Genres
df_genres_expanded = df_combine.explode('genres') # Explode the genres list
top_genres = df_genres_expanded['genres'].value_counts().nlargest(5).index
fig_genres_pie = px.pie(df_genres_expanded[df_genres_expanded['genres'].isin(top_genres)], names="genres", title="Distribution of Genres (Top 5)")
fig_genres_pie.update_layout(width=500, height=500)
fig_genres_pie.show()
print(df_genres_expanded['genres'].nunique())
352
There are a lot of genres, but 3 of them are the most prevalent (Portrait, Landscape, Genre Painting). The most prevalent is actually no genre. If we do use this, we should use a subset of our dataset. Only the most well-known artworks have genres.
Media¶
# Assuming df_combine is your DataFrame
media_list = df_combine['media']
# Initialize an empty list to store individual tags
individual_media = []
# Iterate through the tags_list and extract individual tags
for media in media_list:
media = eval(media) # Convert the string representation to a list
individual_media.extend(media)
# Convert the list to a DataFrame
result_df = pd.DataFrame({'Media': individual_media})
# Calculate tag frequencies
media_counts = result_df['Media'].value_counts()
# Filter tags with counts less than 200
filtered_media = media_counts[media_counts >= 1000]
# Sort the tags based on frequency in descending order
sorted_media = filtered_media.index
# Filter the DataFrame to include only tags with counts >= 200
filtered_df = result_df[result_df['Media'].isin(sorted_media)]
# Create a histogram using Plotly Express
fig_media = px.histogram(filtered_df, x='Media', title='Media Frequency (Counts >= 1000)', category_orders={'Media': sorted_media})
# Show the plot
fig_media.update_layout(width=500, height=500)
fig_media.show()
The most common substrate is canvas and the most common media is oil. Oil-on-canvas paintings usually have a lot of physical depth in it, which was why it was a favoured technique. Our images likely cannot capture the topography. Additionally, many works don't include a media just like with genres. Only the more popular works have any information beyond its constituent art movement. I highly doubt our model will be able to detect the differences in media, in which case high resolution is less important.
Tags¶
# Assuming df_combine is your DataFrame
tags_list = df_combine['tags']
# Initialize an empty list to store individual tags
individual_tags = []
# Iterate through the tags_list and extract individual tags
for tags in tags_list:
tags = eval(tags) # Convert the string representation to a list
individual_tags.extend(tags)
# Convert the list to a DataFrame
result_df = pd.DataFrame({'Individual Tags': individual_tags})
# Calculate tag frequencies
tag_counts = result_df['Individual Tags'].value_counts()
# Filter tags with counts less than 200
filtered_tags = tag_counts[tag_counts >= 1000]
# Sort the tags based on frequency in descending order
sorted_tags = filtered_tags.index
# Filter the DataFrame to include only tags with counts >= 200
filtered_df = result_df[result_df['Individual Tags'].isin(sorted_tags)]
# Create a histogram using Plotly Express
fig_tags = px.histogram(filtered_df, x='Individual Tags', title='Tag Frequency (Counts >= 1000)', category_orders={'Individual Tags': sorted_tags})
# Show the plot
fig_tags.show()
A lot of blank tags as well, but these are some semantic features that our model might detect.
Styles (Art Movement)¶
# Flatten the lists in the 'styles' column
df_combine['styles'] = df_combine['styles'].apply(lambda x: x[0] if isinstance(x, list) else x)
# Count the occurrences of each style
style_counts = df_combine['styles'].value_counts().head(15)
# Create a new DataFrame with only the top 10 styles
df_top_styles = pd.DataFrame({'styles': style_counts.index, 'count': style_counts.values})
# Plotting the histogram using Plotly Express
fig_style = px.bar(df_top_styles, x='styles', y='count', title='Styles (Art Movement)')
# Show the plot
fig_style.update_layout(width=500, height=500)
fig_style.show()
The dataset isn't that balanced between art movements. Honestly, we might be better off calling the wikiartAPI ourselves instead of using wikiARTCrawler. Looking at the wikiART website we can see the total number of artworks by art movement and the number of artists by art movement. We can consider this for a starting point if we do draw our own data.
Aspect Ratio¶
# Calculate aspect ratios
df_combine['aspect_ratio'] = df_combine['width'] / df_combine['height']
# Plot the aspect ratios using Plotly Express
fig_ratio = px.histogram(df_combine, x='aspect_ratio', nbins=100, title='Aspect Ratios Distribution')
fig_ratio.update_layout(xaxis_title='Aspect Ratio', yaxis_title='Count',width=500, height=500)
fig_ratio.update_xaxes(range=[0, 4])
# Calculate median
median_value = df_combine['aspect_ratio'].median()
# Add median line
fig_ratio.add_shape(type='line',
x0=median_value, x1=median_value,
y0=0, y1=30000,
line=dict(color='red', width=2),
name='Median')
# Show the plot
fig_ratio.show()
Most of the images have a near the 1:1 aspect ratio, so we won't have to worry too much about it. We need to decide whether to crop to size, add padding, or to stretch our images.
# Calculate resolution
df_combine['resolution'] = df_combine['width'] * df_combine['height']
# Plot the resolution using Plotly Express
fig_resolution = px.histogram(df_combine, x='resolution', nbins=30, title='Resolution Distribution')
fig_resolution.update_layout(xaxis_title='Resolution', yaxis_title='Count')
# Show the resolution plot
fig_resolution.show()
Looking at the Images¶
Using color.py to find the average colour of each image and saving it to result.csv, we can see if there's any relationship between the average colour of an image to its artist or style.
import pandas as pd
image_df = pd.read_csv('result.csv')
# Assuming df is your DataFrame and 'path_column' is the name of your column
df = pd.DataFrame({'path_column': ["abstract_expressionism\\ljubicast.jpg\\abstract_expressionism/nemanja-vuckovic/ljubicast.jpg"]})
# Define a function to extract the desired part of the path
def extract_artist_name(file_path):
# Normalize the path separator to '/'
file_path = file_path.replace("\\", "/")
# Split the path by '/' and get the relevant part
parts = file_path.split('/')
artist_name = parts[-2] # Assuming the artist name is the second-to-last part
return artist_name
# Apply the function to the DataFrame column and create a new column
image_df['artistName'] = image_df['Image'].apply(extract_artist_name)
image_df.drop(columns=['Image'], inplace=True)
def rgb_to_hex(rgb):
# Convert string values to integers
r = int(rgb[0])
g = int(rgb[1])
b = int(rgb[2])
# Format the RGB values as a hex string
return "#{:02x}{:02x}{:02x}".format(r, g, b)
# Convert string tuples to actual tuples of integers
image_df['AverageColor'] = image_df['AverageColor'].apply(lambda x: tuple(map(int, x.strip('()').split(','))))
# Apply the rgb_to_hex function to create the 'HexColor' column
image_df['HexColor'] = image_df['AverageColor'].apply(rgb_to_hex)
image_df
| Style | Artist | AverageColor | artistName | HexColor | |
|---|---|---|---|---|---|
| 0 | abstract_expressionism | ljubicast.jpg | (86, 74, 113) | nemanja-vuckovic | #564a71 |
| 1 | abstract_expressionism | thorns.jpg | (197, 177, 57) | nemanja-vuckovic | #c5b139 |
| 2 | abstract_expressionism | landscape.jpg | (132, 138, 143) | nemanja-vuckovic | #848a8f |
| 3 | abstract_expressionism | mislead.jpg | (141, 140, 144) | nemanja-vuckovic | #8d8c90 |
| 4 | abstract_expressionism | zeal.jpg | (125, 54, 53) | nemanja-vuckovic | #7d3635 |
| ... | ... | ... | ... | ... | ... |
| 106631 | symbolism | standing-warrior.jpg | (210, 208, 213) | ferdinand-hodler | #d2d0d5 |
| 106632 | symbolism | standing-warrior-1.jpg | (216, 214, 218) | ferdinand-hodler | #d8d6da |
| 106633 | symbolism | giulia-leonardi-1910.jpg | (110, 104, 98) | ferdinand-hodler | #6e6862 |
| 106634 | symbolism | view-of-the-horn-of-fromberg-from-reichenbach-... | (156, 137, 92) | ferdinand-hodler | #9c895c |
| 106635 | symbolism | the-forest-near-reichenbach-1903.jpg | (105, 86, 72) | ferdinand-hodler | #695648 |
106636 rows × 5 columns
Average Colour Per Artist¶
# Assuming 'AverageColor' column contains tuples of RGB values (e.g., ('86', '74', '113'), ('255', '0', '0'), ('0', '255', '0'))
import numpy as np
# Define a custom aggregation function to calculate the mean RGB values
def calculate_mean_rgb(group):
# Convert the tuples in 'AverageColor' to NumPy arrays for easier arithmetic
rgb_arrays = group['AverageColor'].apply(np.array)
# Calculate the mean RGB values
mean_rgb = np.mean(rgb_arrays, axis=0)
# Convert the mean RGB values back to integers
mean_rgb = mean_rgb.astype(int)
# Convert the mean RGB values to a tuple
mean_rgb_tuple = tuple(mean_rgb)
return mean_rgb_tuple
# Group by 'artistName' and apply the custom aggregation function
average_colors_per_artist = image_df.groupby('artistName').apply(calculate_mean_rgb)
# Convert the mean RGB values to hex color codes
average_colors_per_artist_hex = average_colors_per_artist.apply(rgb_to_hex)
# Create a new DataFrame with the results
result_df = pd.DataFrame({
'artistName': average_colors_per_artist.index,
'AverageColorHex': average_colors_per_artist_hex.values
})
# Display the result DataFrame
print(result_df)
artistName AverageColorHex 0 aaron-siskind #605f5d 1 abdullah-suriosubroto #6c6859 2 abidin-dino #938575 3 abraham-storck #857b66 4 achille-dorsi #a39f9a ... ... ... 1225 yuriy-zlotnikov #c2b8a8 1226 yves-laloy #716b68 1227 zakar-zakarian #352211 1228 zdzislaw-beksinski #817465 1229 zinaida-serebriakova #877468 [1230 rows x 2 columns]
# Filter rows with count greater than 1
filtered_df = result_df[result_df.groupby('AverageColorHex')['AverageColorHex'].transform('count') > 1].reset_index()
# Plotting histogram using Plotly
# Plot using Plotly Express
fig = px.bar(filtered_df, x='artistName', color='AverageColorHex', color_discrete_sequence=filtered_df['AverageColorHex'],title='Shared (Count>1) Average Colours by Artist')
fig.update_layout(xaxis_title='Artist', yaxis_title='',bargap=0)
fig.show()
There is very little overlap in average colour per artwork by artist. In the whole dataset these consist of the entirety of those cases.
Average Colour per Style¶
# Assuming 'AverageColor' column contains tuples of RGB values (e.g., ('86', '74', '113'), ('255', '0', '0'), ('0', '255', '0'))
# Group by 'artistName' and apply the custom aggregation function
average_colors_per_style = image_df.groupby('Style').apply(calculate_mean_rgb)
# Convert the mean RGB values to hex color codes
average_colors_per_style_hex = average_colors_per_style.apply(rgb_to_hex)
# Create a new DataFrame with the results
result2_df = pd.DataFrame({
'Style': average_colors_per_style.index,
'AverageColorHex': average_colors_per_style_hex.values
})
# Display the result DataFrame
print(result2_df)
Style AverageColorHex 0 abstract_expressionism #968678 1 baroque #6f6256 2 ecole_de_paris #867768 3 expressionism #857665 4 impressionism #847664 5 naive_art_primitivism #7e7361 6 neo_impressionism #8c7f6c 7 neoclassicism #67584a 8 post_impressionism #857966 9 pre_raphaelite_brotherhood #806e5b 10 realism #796e5e 11 rococo #635344 12 romanticism #7d7060 13 surrealism #857a6d 14 symbolism #85786b
# Plot using Plotly Express
fig = px.bar(result2_df, x='Style', color='AverageColorHex', color_discrete_sequence=result2_df['AverageColorHex'],title='Average Colours by Style')
fig.update_layout(xaxis_title='Average Hex Colour', yaxis_title='',bargap=0.1)
# Show the plot
fig.show()
Some shade of brown is the most common colour for most of the paintings. This is expected as overtime the pigments will degrade and end up producing a sort of brown. (Like art from the Dutch Golden Age had bright pigments due to the VOC bringing back spices from Asia. The pigments have degraded greatly overtime) Additionally, older paints were much more imperfect mixtures, when blended resulted in a brownish paint. This is also a technique used for the background as a foil for the subject of an artwork. So there are many reasons why the paintings converge into different shades of brown.
Summary¶
We should decide if we should accept the dataset as is. There a few issues with it regarding the distribution and its representation. There's the notable lack of renaissance works, which are some of the most well-known works of art.
- Input Image
- Crop/Pad/Segment
- Resolution
- Balanced Representation
- Completion Year
- Art Movement
- Genre
- Media
- We can remove works of art that aren't standard paintings?